Project 4 explores various approaches for implementing a movie recommendation model. Three model approaches were studied that recommend movies based on the similarity of their genre, reviewers, and movies. System I models made predictions based on genres with the code the results included below. System II models made predictions based on user similarity and movie rating similarities. The model implementations matched recommenderlabs results within the goals of the assignment.


Data Analysis

The Project 4 data set contains about 1 million anonymous ratings of approximately 3,900 movies made by 6,040 users who joined MovieLens in 2000. The data set consists of two files with the following structures.

Ratings Dataset

cat(str(ratings))
## 'data.frame':    1000209 obs. of  4 variables:
##  $ UserID   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MovieID  : int  1193 661 914 3408 2355 1197 1287 2804 594 919 ...
##  $ Rating   : int  5 3 3 4 5 3 5 5 4 4 ...
##  $ Timestamp: int  978300760 978302109 978301968 978300275 978824291 978302268 978302039 978300719 978302268 978301368 ...

Movies Dataset

## 'data.frame':    3883 obs. of  4 variables:
##  $ MovieID: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title  : chr  "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
##  $ Genres : chr  "Animation|Children's|Comedy" "Adventure|Children's|Fantasy" "Comedy|Romance" "Comedy|Drama" ...
##  $ Year   : num  1995 1995 1995 1995 1995 ...

The plots above show some characteristics of the data set.



System I: Recommendation based on genres

Two methods for making recommendations using only the preferred genre of the requester were designed and tested. Scheme-A uses overall popularity, while Scheme-B ordains “genre experts” to pick movies.

Data Set Description

System I uses the data set described in the previous section and consists of two data frames: ratings and movies. The rating data refers to films and users only by ID number. The movies data frame provides Title and Genre of the movies.

The top picks for each genre are displayed on a scrollable grid. Click the button on the right to view the code.

Scheme A: Popularity for Movies by Genre

Scheme A uses popular opinion to choose movies based on genre. This model employs a simple approach that averages review ratings for each film and recommends the films with the highest average ratings for each genre. This method performs the following training steps:

Training

  1. Center the reviewer ratings to account for differences in how they interpret the 5-star scale. This adjusts for divergent scales divergent scales for grading bad to good movies. Centering movies across years to remove annual variations was tested but omitted because it had no impact.
  2. Calculate each film’s Average overall review rating (genre not considered).
  3. Filter movies with fewer than 100 reviews (this is a popularity contest) or with below-average ratings (< 0 after centering).
  4. Group films into lists for each genre and slice the N highest-rated films in each. Films with multiple genres get duplicated in each applicable list.
  5. Format top-N genre lists for output and add URL for movie artwork.
  6. <>

    Recommending

    1. The model calculates the top-N lists for all genres. The application provides a genre as input and displays the appropriate top-N list.
    set.seed(3814)
    
    popular_movies = ratings %>% 
      # Center user ratings
      group_by(UserID) %>% 
      mutate(Rating = Rating - mean(Rating)) %>%
      ungroup() %>%
      
      # Summarize reviews by Movie (average of ratings)
      group_by(MovieID) %>%
      summarize(num_reviews = n(), 
                Rating = mean(Rating)) %>%
      filter(num_reviews > 100, Rating > 0) %>% 
      ungroup() %>%
    
      # Add movie details for Title and Genres
      left_join(movies, by="MovieID") %>%
    
      # Duplicate film entries for each of their genres
      separate_rows(Genres, sep = '[|]') %>% 
      arrange(Genres, Rating)
    
    # Create separate Top-10 lists for eache genre
    popular_recs = popular_movies %>% 
      group_by(Genres, Title) %>% 
      summarize(.groups='drop_last', MovieID = first(MovieID), 
                Rating = mean(Rating)) %>% 
      group_by(Genres) %>% 
      slice_max(order_by = Rating, n = 10, with_ties = FALSE) %>%
      ungroup() %>%
      
      # Add urls for movie artwork
      mutate(Image = paste0(small_image_url,MovieID, '.jpg?raw=true'))

    The recommendations are shown below and include mostly blockbuster movies that most users would recognize.

    # Display Results
    movieGrid(popular_recs, "Genre Recommendations Based on Popularity")
    Genre Recommendations Based on Popularity
    Click on grid to scroll and see rollovers
    1 2 3 4 5 6 7 8 9 10
    Action
    Adventure
    Animation
    Children’s
    Comedy
    Crime
    Documentary
    Drama
    Fantasy
    Film-Noir
    Horror
    Musical
    Mystery
    Romance
    Sci-Fi
    Thriller
    War
    Western
    Note: Click on list to scroll to more genres and rollover for more info


Scheme B: “Expert Opinion” on Movies by Genre

Scheme B diverges from general popularity and analyzes how “genre experts” rate the films. In this case, a genre expert is a person who has written many reviews for a particular genre. The expectation is that this approach can recommend some excellent but more ‘sophisticated’ movies. This scheme performs the following steps.

Training

  1. Center the reviewer ratings to account for differences in how they interpret the 5-star scale. This adjusts for divergent scales for grading bad and good movies. Centering movies across years to remove annual variations was tested but omitted because it had no impact.
  2. Add (join) movie details (e.g., Title, Genre) to rating data.
  3. Duplicate reviews into their applicable genre lists. This allows a review to appear in multiple genre lists.
  4. Create a list of reviews for each genre (ds_ratings)
  5. Create a “panel of experts” (genre_experts) for each genre from ds_ratings. The panel consists of users with the most reviews for a particular genre. Experts with over 500 reviews are omitted because we suspect bots or professional reviewers. The top 50 experts by the number of reviews for the genre are selected for the panel. The list below shows the profile of the genre panels of experts. Note that the target panel size is 50, but some have more because of ties in review counts.
  6. set.seed(3814)
    
    # Create lists of movie ratings by genre
    ds_ratings = ratings %>% 
      # Center user ratings
      group_by(UserID) %>% 
      mutate(Rating = Rating - mean(Rating)) %>%
      ungroup() %>%
      
      # Add movie info (title, genres) 
      left_join(movies, by="MovieID") %>%
      mutate(rev_yr = year(as.Date(as.POSIXct(Timestamp, origin="1970-01-01")))) %>%
      
      # Create duplicate reviews for each applicable genre 
      separate_rows(Genres, sep = '[|]') 
      
    # Find the "experts" for each genre - most reviews for a genre
    genre_experts = ds_ratings %>%
      group_by(Genres, UserID) %>%
      summarize(.groups='drop_last', Num_reviews = n()) %>% 
      filter(Num_reviews < 500) %>%
      slice_max(order_by = Num_reviews, n = 50)
      
    kable(genre_experts %>% 
            group_by(Genres) %>% 
            summarize(.groups='drop',
                      Experts = n(), 
                      Ave=round(mean(Num_reviews),0), 
                      Max= max(Num_reviews), 
                      Min=min(Num_reviews)),
          'html', table.attr = "style='width:50%;'",
          caption = "Genre Experts") %>%
      
      kable_styling(bootstrap_options = c("striped", "hover"), 
                    font_size = 12,
                    full_width = FALSE) %>%
      add_header_above(c(" " = 2, "Reviews Per Expert" = 3)) %>%
      scroll_box(width = "800px", height = "350px")
    Genre Experts
    Reviews Per Expert
    Genres Experts Ave Max Min
    Action 51 288 368 257
    Adventure 50 149 192 132
    Animation 52 66 88 57
    Children’s 50 127 187 105
    Comedy 50 405 490 363
    Crime 50 94 152 81
    Documentary 60 22 41 16
    Drama 50 400 493 345
    Fantasy 57 42 54 38
    Film-Noir 57 26 42 22
    Horror 50 173 274 135
    Musical 54 61 92 53
    Mystery 53 55 99 43
    Romance 51 181 391 148
    Sci-Fi 52 172 225 151
    Thriller 51 238 378 194
    War 56 72 124 60
    Western 51 38 57 30
  7. Create a list of expert reviews for movies in each genre by left joining dataframes ds_reviews to genre_experts.
  8. Summarize expert review data by taking the mean rating for each movie. Remove movies with below average Ratings (< 0 normalized).
  9. Create the final recommendation list for each genre by slicing off the top-N rated movies
  10. Format top-N genre lists for output, and add URL for movie artwork.

Recommending

  1. The model calculates the top-N lists for all genres. The application provides a genre as input and displays the appropriate top-N list.
# Create a list of reviews for each genre written by the genre experts 
expert_recs = genre_experts %>%
  left_join(ds_ratings, by=c("Genres", "UserID")) %>%
  
  # Calculate the mean Rating for each movie in a genre list
  group_by(Genres, MovieID) %>%
  summarize(.groups='drop_last',
            Title = first(Title), 
            Rating = mean(Rating), 
            Num_reviews=n()) %>%
  
  # Remove below average movies
  filter(Rating > 0) %>%
  
  # Pick top N rated movies in each list
  slice_max(order_by = Rating, n = 10, with_ties = FALSE) %>%
  
  # Add URL for artwork for output
  mutate(Image = paste0(small_image_url, 
                        MovieID, 
                        '.jpg?raw=true'))

# Display the results on grid below
movieGrid(expert_recs, "Genre Recommendations Based on Expert Opinion")
Genre Recommendations Based on Expert Opinion
Click on grid to scroll and see rollovers
1 2 3 4 5 6 7 8 9 10
Action
Adventure
Animation
Children’s
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western
Note: Click on list to scroll to more genres and rollover for more info

System I Scheme Comparison

The two schemes provide different Top-5 lists, but some genres overlap more than others. Drama, documentary, and thriller genres differ the most different. The top-10 lists show more overlap. Both models returned blockbuster movies with no small independent films. This was expected in the Popularity model. The Expert Opinion model would benefit from more specific classifications of experts, like sex and age, to provide more targeted recommendations. However, user demographics were not part of the data set.

System II: Collaborative recommendation systems

In this section, collaborative models use similarity measures to determine recommendations.

Data Set Description

For System II, The training data set consists of [n x m] (500 users x 3706 movies) realRatingMatrix sparse matrices that contain a user’s rating for each movie. An NA value is assigned if a user has not reviewed the movie. The test set consists of a user’s ratings for a movie ([1 x 3706] realRatingMatrix. The matrices are created using the script provided with the assignment.


System II: User Based (UBCF) Model

The UBCF model creates recommendations based on ratings by reviewers with similar preferences as the requester receiving the recommendations. In application, the requester provides ratings for a handful of movies, which are then matched with reviewers with similar preferences. The model predicts ratings using the k nearest neighbors based on the cosine distance between reviewers and the requester. Finally, the requester receives a list of new movies based on the preferences of similar reviewers.

Training/Prediction

The training and prediction process are performed together and follow the outline from the assignment document.

  1. Center ratings in the train and test matrices for each reviewer (row).
  2. Calculate the cosine distance of the ratings by reviewers from the ratings of the requester.
  3. Transform distances [-1,1] to a similarity scale [0,1] , sort and create a matrix of the N nearest neighbors. [N x # of movies]
  4. Calculate a vector [1 x movies] with the weighted average of nearest neighbor movie ratings and their similarity to the requester using the formula below.

    \[\hat{r}_{al} = \frac{1}{\sum_{i\in N(a)} s_{al}}\sum_{i \in N(a)}s_{ai}r_{il}\]

  5. Set ratings to NA for movies supplied by the requester in the initial query.
  6. Create the predicted ratings by adding back the means to the vector from the previous step.
set.seed(3814)

nn = 20
d_train = normalize(train)
d_test = normalize(test)
train_mat = as(d_train, "matrix")
test_mat = as(d_test, "matrix")

# nn nearest neighbors = reviewers(train_mat) to requester(test_mat)
m_cosine = simil(train_mat, test_mat, method = "cosine")
sim = (m_cosine + 1) / 2
nn_idx = order(sim, decreasing = TRUE)[1:nn]   # index for 


# Create matrix of Top nn from train
T = train_mat[nn_idx,]  # matrix of nearest neighbors (rows)
S = sim[nn_idx]     # similarity of nearest neighbors (rows)

pred_unnorm = !is.na(T)

T = colSums(T * S, na.rm = TRUE) / colSums(pred_unnorm * S, na.rm = TRUE)

T = T * (is.na(test_mat))
mypred = ifelse(T == 0, NA, T + d_test@normalize$row$factors$means)

UBCF Comparison to RecommenderLab Package

The UBCF code closely matched the RecommenderLab results and met the full requirements of the assignment.

recommender.UBCF <- Recommender(train, method = "UBCF",
                                parameter = list(normalize = 'center', 
                                                 method = 'Cosine', 
                                                 nn = 20))

p.UBCF <- predict(recommender.UBCF, test, type="ratings")
p.UBCF <- as.numeric(as(p.UBCF, "matrix"))

NA_diff = sum(is.na(p.UBCF) != is.na(mypred)) ### should be zero
Sum_diff = max(abs(p.UBCF - mypred), na.rm = TRUE)  ### should be less than 1e-06 

kable(data.frame("UBCF Comaparison Tests"= c('NA Response Difference', 
                                       'Sum of Response Differences'),
                 "Results"=c(as.character(NA_diff), 
                             sprintf("%1.4e", (Sum_diff))),
                 "Goal"= c("0", "< 1e-06")),
      align=rep('c', 2),
      caption="UBCF Comparison to RecommendLab",
      'html', table.attr = "style='width:50%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
              font_size = 12,
              full_width = FALSE)
UBCF Comparison to RecommendLab
UBCF.Comaparison.Tests Results Goal
NA Response Difference 0 0
Sum of Response Differences 8.4676e-07 < 1e-06



System II: Movie Based (IBCF) Model Design

The IBCF model uses a list of movies as input and returns predicted ratings of other movies based on similarity to the input list. This model differs from UBCF as it calculates the similarity between movies rather than reviewers. First, the model creates a [m x m] similarity matrix using the training data. Similarity is determined by the k nearest neighbors based on the cosine distance of reviews between movies. Then, the model returns new movies to the requester based on these similar movies. The source code is shown below.

nn = 30
train_mat = as(normalize(train), "matrix")

# similarity matrix M x M
sim_mat =  as(simil(t(train_mat), t(train_mat), method = "cosine"), "matrix")
sim_mat = (sim_mat + 1) / 2
diag(sim_mat) = NA

# Keep nn nearest nearest neighbors
for(i in 1:nrow(sim_mat)) {
  sidx = tail(order(sim_mat[i, ], decreasing = FALSE, na.last = FALSE), nn)
  sim_mat[i, -sidx] <- NA
}
# Make into a sparse matrix to better track NAs
sim_mat = dropNA(sim_mat)  

# Format active user movie preferences (test)
test_mat = as(test, "dgCMatrix")
test_mask = !is.na(as(test, "matrix"))   # create mask for test_mat rows != NA

# Weighted average of similarities by active user input (test)
pred = (sim_mat %*% t(test_mat)) / (sim_mat %*% t(test_mat != 0))
mypred = as(t(pred), "matrix")

# remove user movie pick in active user input (test) 
mypred[test_mask] = NA 

Training/Prediction

The training and prediction process are performed together and follow the outline from the assignment document.

  1. Center ratings in the training matrix for each reviewer (row).
  2. Calculate a [m x m] matrix that contains their cosine distance from each other based on the ratings by reviewers in the training set.
  3. Transform distances scale [-1,1] to a similarity scale [0,1], sort and create a matrix of the N nearest neighbors for each movie.
  4. Predict the requester’s rating of movies based on the weighted average of movie similarities to the input list of their movie ratings. The formula is shown below.

    \[\hat{r}_{al} = \frac{1}{\sum_{i\in S(l)} s_{li}}\sum_{i \in S(l)}s_{li}r_{ai}\]

  5. Set ratings to NA for movies supplied by the requester in the initial query.

Question: why do we encounter such a big discrepancy for IBCF, but not for UBCF?

The cosine distance algorithm differentiates between user preference similarities (UBCF) much better than movie similarities. For IBCF, most movies were rated as identical (similarity) to many (often hundreds of) other movies. Therefore, there was not enough differentiation to choose a definitive set. Hence, different model implementations are likely to choose different movies based on how they choose the top 30 most similar films (nearest neighbors) among many identical similarities. To match RecommenderLab results, the KNN algorithm had to match precisely.

IBCF Comparison to RecommenderLab Package

recommender.IBCF <- Recommender(train, method = "IBCF",
                                parameter = list(normalize = 'center', 
                                                 method = 'Cosine', 
                                                 k = 30))

p.IBCF <- predict(recommender.IBCF, test, type="ratings")
p.IBCF <- as.numeric(as(p.IBCF, "matrix"))

## first output: should be less than 10
NA_diff = sum(is.na(p.IBCF) != is.na(mypred))  

## second output: should be less than 10%
mydiff = abs(p.IBCF - mypred)
Sum_diff = sum(mydiff[!is.na(mydiff)] > 1e-6) / sum(!is.na(mydiff)) 

kable(caption="IBCF Comparison to RecommendLab",
      data.frame("IBCF Comaparison Test"= c('NA Response Difference', 
                                       'Sum of Response Differences'),
                 "Results"=c(as.character(NA_diff), 
                             sprintf("%1.4e", Sum_diff)),
                 "Goal"= c("< 10", "< 10%")),
      align=rep('c', 2),
      'html', table.attr = "style='width:50%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
              font_size = 12,
              full_width = FALSE)
IBCF Comparison to RecommendLab
IBCF.Comaparison.Test Results Goal
NA Response Difference 0 < 10
Sum of Response Differences 0.0000e+00 < 10%